========================================================

1 Introduction

This project is an exploratory data analysis of campaign contributions for the 2012 presidential election in the state of Ohio. The state of Ohio is historically a critical swing state for presidential elections and, according to wikipedia, has the longest streak of matching the overall election outcome (since 1960). It would be a mistake to assume that campaign contributions are a predictor or even loosely correlated with votes (I might be biting my tongue if we were talking about superPACs though), but Ohio is one of the most frequently visited states on a presidential nominee’s campaign trail and for this reason it piqued my interest. I’ll explore the nature of campaign contributions and see if there are any interesting relationships in the data. I’ll also take a look at the geographic distribution of campaign contributions by zipcode. This exploration may shed light on the characteristics of a generous contributor and perhaps why particular party candidates visit certain areas in the state. Anything of significance could be used in future inferential or predictive analysis.


2 Load Data

There are 3 data sets here:

  • campaign contribution data
  • geographic data (to generate maps)
  • demographic data (to add extra features)

We’ll link these datasets by zipcode much like tables in a database. The geographic and demographic data come from a library and are already nicely formatted. The features in the campaign contribution data, however, might need some explicit class declarations

Let’s peak at the first few rows to inspect the atomic classes (aka types).

df_look <- read.csv("campaign-contributions-ohio-2012.csv", nrows = 100)
## Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed

While trying to read the file, we get some kind of error about the row names. Let’s explicitly set row.names = NULL to see if that fixes it.

df_look <- read.csv("campaign-contributions-ohio-2012.csv", nrows = 100,
                    row.names=NULL)
##   row.names   cmte_id       cand_id              cand_nm    contbr_nm
## 1 C00495820 P80000748     Paul, Ron    BROWN, TODD W MR.      BATAVIA
## 2 C00495820 P80000748     Paul, Ron   DIEHL, MARGO SONJA       CANTON
## 3 C00495820 P80000748     Paul, Ron KIRCHMEYER, BENJAMIN NORTH CANTON
## 4 C00431445 P80003338 Obama, Barack       KEYES, STEPHEN       BEXLEY
## 5 C00431445 P80003338 Obama, Barack       MURPHY, MIKE W     COLUMBUS
## 6 C00431445 P80003338 Obama, Barack     ASHMAN, LEWIS J.       DAYTON
##   contbr_city contbr_st                      contbr_zip
## 1          OH 451034017                GENERAL ELECTRIC
## 2          OH     44718                            NONE
## 3          OH     44720                    DIEBOLD, INC
## 4          OH 432091491 NATIONWIDE MUTUAL INSURANCE CO.
## 5          OH     43214               ROCK TENN COMPANY
## 6          OH 454091226         KETTERING SCHOOL SYSTEM
##           contbr_employer contbr_occupation contb_receipt_amt
## 1                ENGINEER              50.0         06-DEC-11
## 2                 RETIRED              25.0         06-DEC-11
## 3     COMPUTER PROGRAMMER             201.2         06-DEC-11
## 4 HR EXECUTIVE / ATTORNEY             100.0         30-SEP-11
## 5                 MANAGER              50.0         26-SEP-11
## 6                 TEACHER              50.0         27-SEP-11
##   contb_receipt_dt receipt_desc memo_cd memo_text form_tp  file_num
## 1               NA           NA      NA     SA17A  779227   0925592
## 2               NA           NA      NA     SA17A  779227   0925663
## 3               NA           NA      NA     SA17A  779227   0925696
## 4               NA           NA      NA     SA17A  756218 C12432798
## 5               NA           NA      NA     SA17A  756218 C12223508
## 6               NA           NA      NA     SA17A  756218 C12234248
##   tran_id election_tp
## 1   P2012          NA
## 2   P2012          NA
## 3   P2012          NA
## 4   P2012          NA
## 5   P2012          NA
## 6   P2012          NA

By setting the row names to NULL we forced row numbering but for some reason it used an existing column as the row number. Using head to look at the first 6 lines, it seems like there is a mismatch between the header and the rest of the data. Let’s manually open a connection to the file and read a few lines including the header to see if we can pinpoint the problem.

con <- file("campaign-contributions-ohio-2012.csv", "r")
lines <- readLines(con, 5); close(con); lines
## [1] "cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp"                            
## [2] "C00495820,\"P80000748\",\"Paul, Ron\",\"BROWN, TODD W MR.\",\"BATAVIA\",\"OH\",\"451034017\",\"GENERAL ELECTRIC\",\"ENGINEER\",50,06-DEC-11,\"\",\"\",\"\",\"SA17A\",\"779227\",\"0925592\",\"P2012\","                                 
## [3] "C00495820,\"P80000748\",\"Paul, Ron\",\"DIEHL, MARGO SONJA\",\"CANTON\",\"OH\",\"44718\",\"NONE\",\"RETIRED\",25,06-DEC-11,\"\",\"\",\"\",\"SA17A\",\"779227\",\"0925663\",\"P2012\","                                                  
## [4] "C00495820,\"P80000748\",\"Paul, Ron\",\"KIRCHMEYER, BENJAMIN\",\"NORTH CANTON\",\"OH\",\"44720\",\"DIEBOLD, INC\",\"COMPUTER PROGRAMMER\",201.2,06-DEC-11,\"\",\"\",\"\",\"SA17A\",\"779227\",\"0925696\",\"P2012\","                   
## [5] "C00431445,\"P80003338\",\"Obama, Barack\",\"KEYES, STEPHEN\",\"BEXLEY\",\"OH\",\"432091491\",\"NATIONWIDE MUTUAL INSURANCE CO.\",\"HR EXECUTIVE / ATTORNEY\",100,30-SEP-11,\"\",\"\",\"\",\"SA17A\",\"756218\",\"C12432798\",\"P2012\","

It’s difficult to tell from this output, but it looks like we have an extra empty column at the end of each row. We could use count.fields() but that opens up the entire file. Instead, let’s create a smaller test file to do this.

con <- file("campaign-contributions-ohio-2012.csv", "r")
lines <- readLines(con, 20); close(con)

# using write.csv() here seems to cause problems with escape characters
# using a connection object works better
con <- file("test.csv"); writeLines(lines, con); close(con)
count.fields("test.csv", sep = ",")
##  [1] 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19

This confirms that the data following the header somehow has an extra field. Looking back at the results from head and readlines, we can see that the extra field is full of NAs and it’s just an empty field caused by "," at the end of each row.

We can fix this a few different ways. We can set the row names to NULL again, shift the column names to the left, and then remove the last column from the dataframe. Or, we can get the header row by itself and then use it as the column name for the rest of the data without that last empty column. The second options seems easier so we’ll go with that. After, we can finally peak into the correctly structured data to see if we need to make any adjustments (like strings as factors).

## 'data.frame':    19 obs. of  18 variables:
##  $ cmte_id          : Factor w/ 2 levels "C00431445","C00495820": 2 2 2 1 1 1 1 1 1 2 ...
##  $ cand_id          : Factor w/ 2 levels "P80000748","P80003338": 1 1 1 2 2 2 2 2 2 1 ...
##  $ cand_nm          : Factor w/ 2 levels "Obama, Barack",..: 2 2 2 1 1 1 1 1 1 2 ...
##  $ contbr_nm        : Factor w/ 19 levels "ASHMAN, LEWIS J.",..: 4 5 11 10 15 1 12 3 9 14 ...
##  $ contbr_city      : Factor w/ 17 levels "BATAVIA","BEXLEY",..: 1 3 10 2 6 7 5 4 14 8 ...
##  $ contbr_st        : Factor w/ 1 level "OH": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : int  451034017 44718 44720 432091491 43214 454091226 441201574 452243150 45690 440775592 ...
##  $ contbr_employer  : Factor w/ 17 levels "CH2M HILL","DIEBOLD, INC",..: 6 10 2 9 13 7 12 12 11 4 ...
##  $ contbr_occupation: Factor w/ 15 levels "BUSINESS OWNER",..: 3 11 2 4 6 15 10 11 14 12 ...
##  $ contb_receipt_amt: num  50 25 201 100 50 ...
##  $ contb_receipt_dt : Factor w/ 10 levels "04-AUG-11","06-DEC-11",..: 2 2 2 9 6 7 9 6 1 2 ...
##  $ receipt_desc     : logi  NA NA NA NA NA NA ...
##  $ memo_cd          : logi  NA NA NA NA NA NA ...
##  $ memo_text        : logi  NA NA NA NA NA NA ...
##  $ form_tp          : Factor w/ 1 level "SA17A": 1 1 1 1 1 1 1 1 1 1 ...
##  $ file_num         : int  779227 779227 779227 756218 756218 756218 756218 756218 756218 779227 ...
##  $ tran_id          : Factor w/ 19 levels "0922774","0922801",..: 6 7 8 18 14 16 19 15 10 1 ...
##  $ election_tp      : Factor w/ 1 level "P2012": 1 1 1 1 1 1 1 1 1 1 ...

There are more features that we want as factors than characters so we’ll keep that default setting. Contributor name should be a character, and zipcode should also be a character since it looks like some zipcodes are the full 9 digits and thus we’re gonna have to do some string manipulation. The receipt date should be a date object but we can also set that manually after we read in the data.

Now we can read in the entire contributions file. We’ll have to skip the header again but since we already have the column names from the test dataframe we can set them easily. We add a dummy name to the column names variable to account for the empty entry and then delete that column afterwards.

column_names <- c(column_names, "dummy")
df <- read.csv("campaign-contributions-ohio-2012.csv",
                     header = FALSE, col.names = column_names, skip = 1)
df$dummy <- NULL; head(df)
##     cmte_id   cand_id       cand_nm            contbr_nm  contbr_city
## 1 C00495820 P80000748     Paul, Ron    BROWN, TODD W MR.      BATAVIA
## 2 C00495820 P80000748     Paul, Ron   DIEHL, MARGO SONJA       CANTON
## 3 C00495820 P80000748     Paul, Ron KIRCHMEYER, BENJAMIN NORTH CANTON
## 4 C00431445 P80003338 Obama, Barack       KEYES, STEPHEN       BEXLEY
## 5 C00431445 P80003338 Obama, Barack       MURPHY, MIKE W     COLUMBUS
## 6 C00431445 P80003338 Obama, Barack     ASHMAN, LEWIS J.       DAYTON
##   contbr_st contbr_zip                 contbr_employer
## 1        OH  451034017                GENERAL ELECTRIC
## 2        OH      44718                            NONE
## 3        OH      44720                    DIEBOLD, INC
## 4        OH  432091491 NATIONWIDE MUTUAL INSURANCE CO.
## 5        OH      43214               ROCK TENN COMPANY
## 6        OH  454091226         KETTERING SCHOOL SYSTEM
##         contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc
## 1                ENGINEER              50.0        06-DEC-11             
## 2                 RETIRED              25.0        06-DEC-11             
## 3     COMPUTER PROGRAMMER             201.2        06-DEC-11             
## 4 HR EXECUTIVE / ATTORNEY             100.0        30-SEP-11             
## 5                 MANAGER              50.0        26-SEP-11             
## 6                 TEACHER              50.0        27-SEP-11             
##   memo_cd memo_text form_tp file_num   tran_id election_tp
## 1                     SA17A   779227   0925592       P2012
## 2                     SA17A   779227   0925663       P2012
## 3                     SA17A   779227   0925696       P2012
## 4                     SA17A   756218 C12432798       P2012
## 5                     SA17A   756218 C12223508       P2012
## 6                     SA17A   756218 C12234248       P2012

I’m curious to see how the readr package might have handled this, as it’s supposed to be more intuitive with reading flat files. The only drawback is that we would have to set the factors manually which could be pretty lengthy. In any case, let’s check it out.

df_readr <- read_csv("campaign-contributions-ohio-2012.csv",
                     n_max = 100)
## Warning: 101 parsing failures.
## row col   expected     actual
##   1  -- 18 columns 19 columns
##   2  -- 18 columns 19 columns
##   3  -- 18 columns 19 columns
##   4  -- 18 columns 19 columns
##   5  -- 18 columns 19 columns
## ... ... .......... ..........
## .See problems(...) for more details.
## Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  18 variables:
##  $ cmte_id          : chr  "C00495820" "C00495820" "C00495820" "C00431445" ...
##  $ cand_id          : chr  "P80000748" "P80000748" "P80000748" "P80003338" ...
##  $ cand_nm          : chr  "Paul, Ron" "Paul, Ron" "Paul, Ron" "Obama, Barack" ...
##  $ contbr_nm        : chr  "BROWN, TODD W MR." "DIEHL, MARGO SONJA" "KIRCHMEYER, BENJAMIN" "KEYES, STEPHEN" ...
##  $ contbr_city      : chr  "BATAVIA" "CANTON" "NORTH CANTON" "BEXLEY" ...
##  $ contbr_st        : chr  "OH" "OH" "OH" "OH" ...
##  $ contbr_zip       : int  451034017 44718 44720 432091491 43214 454091226 441201574 452243150 45690 440775592 ...
##  $ contbr_employer  : chr  "GENERAL ELECTRIC" "NONE" "DIEBOLD, INC" "NATIONWIDE MUTUAL INSURANCE CO." ...
##  $ contbr_occupation: chr  "ENGINEER" "RETIRED" "COMPUTER PROGRAMMER" "HR EXECUTIVE / ATTORNEY" ...
##  $ contb_receipt_amt: num  50 25 201 100 50 ...
##  $ contb_receipt_dt : chr  "06-DEC-11" "06-DEC-11" "06-DEC-11" "30-SEP-11" ...
##  $ receipt_desc     : chr  "" "" "" "" ...
##  $ memo_cd          : chr  "" "" "" "" ...
##  $ memo_text        : chr  "" "" "" "" ...
##  $ form_tp          : chr  "SA17A" "SA17A" "SA17A" "SA17A" ...
##  $ file_num         : int  779227 779227 779227 756218 756218 756218 756218 756218 756218 779227 ...
##  $ tran_id          : chr  "0925592" "0925663" "0925696" "C12432798" ...
##  $ election_tp      : chr  "P2012" "P2012" "P2012" "P2012" ...
##  - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 101 obs. of  4 variables:
##   ..$ row     : int  1 2 3 4 5 6 7 8 9 10 ...
##   ..$ col     : chr  NA NA NA NA ...
##   ..$ expected: chr  "18 columns" "18 columns" "18 columns" "18 columns" ...
##   ..$ actual  : chr  "19 columns" "19 columns" "19 columns" "19 columns" ...

We got a warning about the column mismatch but readr handled the discrepancy. We could have done this instead and set the factors one by one but read.csv allows us to set all characters as factors which was advantageous. It took way more effort than readr would have, but it also allowed us to pinpoint the exact problem.

Back to loading the rest of the data. Let’s load the geographic and demographic data.

## 'data.frame':    5403044 obs. of  13 variables:
##  $ long      : num  -88.4 -88.4 -88.4 -88.4 -88.4 ...
##  $ lat       : num  31.1 31.1 31.1 31.1 31.1 ...
##  $ order     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hole      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ piece     : Factor w/ 21 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group     : Factor w/ 42275 levels "0.1","1.1","1.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ id        : chr  "0" "0" "0" "0" ...
##  $ ZCTA5CE10 : Factor w/ 32989 levels "01001","01002",..: 11619 11619 11619 11619 11619 11619 11619 11619 11619 11619 ...
##  $ AFFGEOID10: Factor w/ 32989 levels "8600000US01001",..: 11619 11619 11619 11619 11619 11619 11619 11619 11619 11619 ...
##  $ GEOID10   : Factor w/ 32989 levels "01001","01002",..: 11619 11619 11619 11619 11619 11619 11619 11619 11619 11619 ...
##  $ ALAND10   : num  5.7e+08 5.7e+08 5.7e+08 5.7e+08 5.7e+08 ...
##  $ AWATER10  : num  3241962 3241962 3241962 3241962 3241962 ...
##  $ region    : chr  "36522" "36522" "36522" "36522" ...
## 'data.frame':    33120 obs. of  9 variables:
##  $ region           : chr  "00601" "00602" "00603" "00606" ...
##  $ total_population : num  18450 41302 53683 6591 28963 ...
##  $ percent_white    : num  1 4 2 0 1 0 0 1 2 0 ...
##  $ percent_black    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ percent_asian    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ percent_hispanic : num  99 94 96 100 99 100 100 99 98 100 ...
##  $ per_capita_income: num  7380 8463 9176 6383 7892 ...
##  $ median_rent      : num  285 319 252 230 334 315 285 338 400 319 ...
##  $ median_age       : num  36.6 38.6 38.9 37.3 39.2 38.5 40.9 36.2 42 39.7 ...

For both of these dataframes, the region variable (zipcode) is what we’ll be using to relate to the contributions data. For the demographic data, we can add the fields we want to the contributions dataframe. However, we’ll be making choropleth maps with the geographic data from different aggregate statistics from the contributions dataframe. So we’ll have to concatenate these 2 on the fly depending upon what we want to show.


3 Data Manipulation

Before adding the demographic data, let’s get the contributions dataframe prepared. Let’s get rid of the data we don’t want and rename the columns. There are quite a few features in the contribution data that don’t really help our exploration like the ID number of the candidate or the file number of the contribution. The only features of interest are candidate, name (of the contributor), city, state, zipcode, employer, occupation, amount, and date.

df <- df[, 3:11] # the features we want are consecutive
# rename the features to make them more readable
names(df) <- c("candidate", "name", "city", "state", "zipcode",
                 "employer", "occupation", "amount", "date")
summary(df)
##           candidate                        name                city       
##  Obama, Barack :91286   WADE, ANNIE SANDS    :   114   CINCINNATI: 18596  
##  Romney, Mitt  :50672   DUPHIL, MONIQUE      :    99   COLUMBUS  : 13820  
##  Paul, Ron     : 4271   RINGO, RICHARD A. MR.:    97   DAYTON    :  5536  
##  Santorum, Rick: 2012   POWERS, KAREN        :    90   CLEVELAND :  4748  
##  Gingrich, Newt: 1432   KNEELAND, HAROLD     :    88   TOLEDO    :  3276  
##  Cain, Herman  :  583   RUPPER, DARVIS       :    86   AKRON     :  2965  
##  (Other)       : 1223   (Other)              :150905   (Other)   :102538  
##  state          zipcode      
##  OH:151479   43214  :   655  
##              44122  :   586  
##              44118  :   584  
##              43202  :   511  
##              44107  :   507  
##              45220  :   476  
##              (Other):148160  
##                                    employer    
##  RETIRED                               :34269  
##  SELF-EMPLOYED                         :11503  
##  NOT EMPLOYED                          : 8657  
##  INFORMATION REQUESTED PER BEST EFFORTS: 6049  
##  INFORMATION REQUESTED                 : 4217  
##  (Other)                               :86741  
##  NA's                                  :   43  
##                                   occupation        amount        
##  RETIRED                               :38151   Min.   :-15000.0  
##  INFORMATION REQUESTED PER BEST EFFORTS: 5778   1st Qu.:    25.0  
##  HOMEMAKER                             : 4674   Median :    50.0  
##  PHYSICIAN                             : 4458   Mean   :   202.5  
##  ATTORNEY                              : 4056   3rd Qu.:   150.0  
##  (Other)                               :94351   Max.   : 10000.0  
##  NA's                                  :   11                     
##         date       
##  17-OCT-12:  4176  
##  02-NOV-12:  3465  
##  23-OCT-12:  3066  
##  22-OCT-12:  2569  
##  31-OCT-12:  2547  
##  31-AUG-12:  2534  
##  (Other)  :133122

There’s something interesting here that we can see by leaving the contributor names as factors. There are a few contributors who made nearly 100 contributions. We can also see which zipcodes had the most contributions. This reinforces my thoughts about splitting the data up into 2 different sets, one for individual contributions and another for aggregate contributions per person. If in the end our aim is to build a model for predicting contributions it might be more practical to predict total contributions rather than individual amounts. We can create an aggregate contributions dataset at the end of our data munging process when we have all the other features we want. Then we can compare feature correlations between datasets to see if either one looks more promising for model building.

Let’s coerce those features that we mentioned before to characters and make date into a date object.

to_character <- c("name", "zipcode", "date")

for(col in to_character) {
    df[, col] <- as.character(df[, col])
}

df$date <- as.Date(df$date, format = "%d-%b-%y"); sapply(df, class)
##   candidate        name        city       state     zipcode    employer 
##    "factor" "character"    "factor"    "factor" "character"    "factor" 
##  occupation      amount        date 
##    "factor"   "numeric"      "Date"

Now let’s clean the zipcodes so that they all have just 5 digits, and so that we can relate them to the zipcodes in the geographic and demographic data. We also have to make sure that they are in Ohio.

df$zipcode <- substring(df$zipcode, 1, 5)
# if starts with '45' or '44' or '43' then in Ohio
zip_legit <- as.character(seq(43000, 46000, by = 1))
# convert non-legit zipcodes to 'NA' character
df$zipcode <- ifelse(df$zipcode %in% zip_legit, df$zipcode, 'NA')
summary(as.numeric(df$zipcode))
## Warning in summary(as.numeric(df$zipcode)): NAs introduced by coercion
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   43000   43550   44150   44320   45210   45900     219

Looks like they are all within the correct range with the exception of the NAs. This data won’t be usable for the choropleth maps but it’s still useful for other statistics so we’ll keep it.

We do have negative amounts though, which happen to be refunds, so we’ll have to exclude those observations.

bad <- df$amount <= 0; sum(bad); df <- df[!bad, ]
## [1] 1272

There were 1272 refunds excluded.

Let’s engineer some features in order to make more use of the data that we have. We’ll get gender from name and party status from candidate.

# make first name feature from full name
df$first_name <- substring(str_extract(df$name, ',\\s[A-Z]+'), 3, )
# get dataframe of unique first names and gender
gender_names_df <- gender(unique(df$first_name))
gender_names_df <- gender_names_df[, c("name", "gender")]

# merge dataframe with gender names
df <- merge(df, gender_names_df,
            by.x = 'first_name', by.y = 'name', all.x = TRUE)

# get rid of first name and gender names df, make gender a factor
df$first_name <- NULL; rm(gender_names_df); df$gender <- as.factor(df$gender)
# NA gender count
print("missing gender:"); sum(is.na(df$gender))
## [1] "missing gender:"
## [1] 4677
candidates <- unique(df$candidate)
democrat <- "Obama, Barack"; green <- "Stein, Jill"
republican <- candidates[!(candidates %in% c(democrat, green))]
df$party <- ifelse(df$candidate %in% republican, "republican",
                   ifelse(df$candidate == democrat, "democrat", "green"))
df$party <- as.factor(df$party)

Now we can add demographic data to the dataset and get coordinates for the cities from the web. The cities will help us to see things better on the map.

demographics.ohio <- df_zip_demographics %>% filter(region %in% zip_legit)
df <- merge(df, demographics.ohio,
            by.x = 'zipcode', by.y = 'region', all.x = TRUE)
# shorten the names
colnames(df)[12:19] <- c("population", "pcnt_wht", "pcnt_blk", "pcnt_asn",
                         "pcnt_hsp", "percap_incm", "med_rent", "med_age")
rm(df_zip_demographics)

# get city coordinates
webpage_ohio_cities <-
  read_html("http://www.geonames.org/US/OH/largest-cities-in-ohio.html")
city_names <- webpage_ohio_cities %>%
  xml_find_all("//tr/td/a[contains(@href, 'geonames')]/text()") %>%
  as.character()
city_pop <- webpage_ohio_cities %>%
  xml_find_all("//tr/td[contains(@class, 'rightalign')]/text()") %>%
  as.character() %>%
  sub(",", "", .) %>%
  as.numeric()
city_coord <- webpage_ohio_cities %>%
  xml_find_all('//tr/td/a[contains(@href, "maps")]/text()') %>%
  as.character()
city_lat <- as.numeric(str_extract(city_coord, '^[0-9.]+'))
city_long <- as.numeric(str_extract(city_coord, '[-0-9.]+$'))
mapdata.ohio.cities <- data.frame(city_names, city_long, city_lat, city_pop)

Finally, we subset the map data for Ohio only since it includes the entire United States.

mapdata.ohio <- zip.map %>% filter(region %in% zip_legit) %>% arrange(order)
rm(zip.map)

## The correlations for the aggregate dataframe were only slightly better
## and there were lots of data inconsistencies to make use of occupation
## so this was not included in further analysis.

# Finally, we can create the aggregate contributions dataframe.
# I'll consider anybody with the same name, of the same gender, in the same zipcode, who made
# contributions to the same party, the same person
# (occupation and employer proved to be inconsistent for the same person across different contributions).
if(FALSE) {
df_agg <- df
# strip name of non-alphanumeric chars and whitespace
df_agg$name <- df_agg$name %>% str_replace_all("[^[:alnum:]]", " ") %>% str_trim(side = "both")
# aggregate and set any inconsistencies in occupation to the string with lesser value (better hack?)
df_agg <- df_agg %>%
  group_by(name, gender, zipcode, party, percap_incm, med_age) %>%
  summarise(count = n(), total_amt = sum(amount), occupation = min(as.character(occupation)))
df_agg$occupation <- as.factor(df_agg$occupation)
df_agg <- as.data.frame(df_agg)
# summary and correlations
summary(df_agg); str(df_agg)
library(polycor); hetcor(df_agg[, c(2, 4:8)])
}

# save/load data into .Rdata file for easy start
#save(df, demographics.ohio, mapdata.ohio, mapdata.ohio.cities, file = "processed-data.Rdata")
#load("processed-data.Rdata")

4 Univariate Plots Section

knitr::opts_chunk$set(echo=FALSE, warning=FALSE, messages=FALSE)

Now that the data is in the right format let’s start our univariate exploration.

##  [1] "zipcode"     "candidate"   "name"        "city"        "state"      
##  [6] "employer"    "occupation"  "amount"      "date"        "gender"     
## [11] "party"       "population"  "pcnt_wht"    "pcnt_blk"    "pcnt_asn"   
## [16] "pcnt_hsp"    "percap_incm" "med_rent"    "med_age"

We have 19 different variables,

## 'data.frame':    150207 obs. of  19 variables:
##  $ zipcode    : chr  "43001" "43001" "43001" "43001" ...
##  $ candidate  : Factor w/ 14 levels "Bachmann, Michele",..: 12 7 12 12 12 12 12 12 12 12 ...
##  $ name       : chr  "CHAULK, SARAH" "BAKER, NANCY" "TODD, DARRELL M. MR." "CHAULK, JOSEPH MR." ...
##  $ city       : Factor w/ 1167 levels "ABERDEEN","ADA",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ state      : Factor w/ 1 level "OH": 1 1 1 1 1 1 1 1 1 1 ...
##  $ employer   : Factor w/ 13507 levels "","1099","121 ARW",..: 10417 994 5647 1634 5647 9804 2692 1634 1634 1634 ...
##  $ occupation : Factor w/ 6846 levels "","-","100% DISABLED VIETNAM VETERAN",..: 2594 4148 2960 882 2960 5231 1282 882 882 882 ...
##  $ amount     : num  546 35 100 92.2 125 ...
##  $ date       : Date, format: "2012-10-27" "2012-06-25" ...
##  $ gender     : Factor w/ 2 levels "female","male": 1 1 2 2 2 2 2 2 2 2 ...
##  $ party      : Factor w/ 3 levels "democrat","green",..: 3 1 3 3 3 3 3 3 3 3 ...
##  $ population : num  2295 2295 2295 2295 2295 ...
##  $ pcnt_wht   : num  93 93 93 93 93 93 93 93 93 93 ...
##  $ pcnt_blk   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pcnt_asn   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pcnt_hsp   : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ percap_incm: num  34306 34306 34306 34306 34306 ...
##  $ med_rent   : num  592 592 592 592 592 592 592 592 592 592 ...
##  $ med_age    : num  46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 ...

with varying classes.

##    zipcode                   candidate         name          
##  Length:150207      Obama, Barack :90588   Length:150207     
##  Class :character   Romney, Mitt  :50198   Class :character  
##  Mode  :character   Paul, Ron     : 4246   Mode  :character  
##                     Santorum, Rick: 2001                     
##                     Gingrich, Newt: 1391                     
##                     Cain, Herman  :  581                     
##                     (Other)       : 1202                     
##          city        state      
##  CINCINNATI: 18446   OH:150207  
##  COLUMBUS  : 13703              
##  DAYTON    :  5488              
##  CLEVELAND :  4697              
##  TOLEDO    :  3261              
##  AKRON     :  2940              
##  (Other)   :101672              
##                                    employer    
##  RETIRED                               :34158  
##  SELF-EMPLOYED                         :11477  
##  NOT EMPLOYED                          : 8648  
##  INFORMATION REQUESTED PER BEST EFFORTS: 6030  
##  INFORMATION REQUESTED                 : 4216  
##  (Other)                               :85635  
##  NA's                                  :   43  
##                                   occupation        amount       
##  RETIRED                               :38029   Min.   :    1.0  
##  INFORMATION REQUESTED PER BEST EFFORTS: 5759   1st Qu.:   25.0  
##  HOMEMAKER                             : 4623   Median :   50.0  
##  PHYSICIAN                             : 4447   Mean   :  210.8  
##  ATTORNEY                              : 4033   3rd Qu.:  150.0  
##  (Other)                               :93305   Max.   :10000.0  
##  NA's                                  :   11                    
##       date               gender             party         population   
##  Min.   :2011-01-28   female:66613   democrat  :90588   Min.   :    0  
##  1st Qu.:2012-07-05   male  :78917   green     :   19   1st Qu.:16076  
##  Median :2012-09-17   NA's  : 4677   republican:59600   Median :25049  
##  Mean   :2012-08-05                                     Mean   :26621  
##  3rd Qu.:2012-10-17                                     3rd Qu.:35078  
##  Max.   :2012-12-27                                     Max.   :68475  
##                                                         NA's   :973    
##     pcnt_wht         pcnt_blk        pcnt_asn         pcnt_hsp     
##  Min.   :  0.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 76.00   1st Qu.: 2.00   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 87.00   Median : 4.00   Median : 2.000   Median : 2.000  
##  Mean   : 79.75   Mean   :12.22   Mean   : 2.936   Mean   : 2.842  
##  3rd Qu.: 93.00   3rd Qu.:13.00   3rd Qu.: 4.000   3rd Qu.: 3.000  
##  Max.   :100.00   Max.   :94.00   Max.   :22.000   Max.   :61.000  
##  NA's   :987      NA's   :987     NA's   :987      NA's   :987     
##   percap_incm       med_rent         med_age     
##  Min.   :  864   Min.   : 213.0   Min.   : 6.80  
##  1st Qu.:23951   1st Qu.: 550.0   1st Qu.:36.30  
##  Median :30291   Median : 643.0   Median :39.50  
##  Mean   :32167   Mean   : 668.4   Mean   :39.25  
##  3rd Qu.:38759   3rd Qu.: 749.0   3rd Qu.:43.10  
##  Max.   :67742   Max.   :1475.0   Max.   :83.50  
##  NA's   :1023    NA's   :1735     NA's   :992

The most popular candidate to receive contributions was Barack Obama and the most popular party was the democratic party. The city with the most contributions was Cincinnati and retired people represented the occupation with the largest amount of contributions. The demographic of males made more contributions than females. The median contribution was $50 and the average is $211.

We have long-tailed data for the amount of contributions. There are so many contributions made under $1,000 that it’s hard to see any of the outliers. A logarithmic transformation of the x-axis reveals something of a log-normal distribution for the amount.

There are a significant group of people (4195) that, despite the long-tailed distribution, contribute $2,500. It may be that there is a tax write-off reason for this. Looking closer we can see that there appear to be several discrete values in increments of $50 that people are accustomed to contributing. Under $60, we can see contributions spaced in intervals of $5.

An overwhelming majority of contributions (93.5%) were made in 2012, and we can see a steady increase in contributions leading up to the election in November. There seems to be a slight increase in the amount of contributions made toward the end of the month. There is also a peak at about halfway through the month. People might be making contributions immediately after receiving their paycheck or pension.

Although a significant difference between the amount of male and female contributions, the proportion (0.54 in favor of males) of the gap is not very large.

Most contributions are made to either Obama or Romney. Using a log scale we can see the other candidates a little better. Cumulatively, there are a significant amount of contributions made to other candidates but they are mostly Republican. The proportion of contributions to democrats vs. republicans seems to resemble the previous histrogram between Obama and Romney. The amount of contributions to the green party is so small (19 contributions) that we might want to exclude for simplicity.

It’s important to remember that the remaining demographic variables correspond to the contributor’s zipcode and not to the contributor him/herself.

The distribution of population in which contributors live is fairly normal with mean 2.662110^{4} and median 2.504910^{4}.

Most contributors live in areas with a high percentage of white ethnicity and low percentages of black, asian, or hispanic ethnicities.

The distribution of per capita income in which contributors live appears poisson, with the bulk of contributors living in zipcodes with a range of per capita income of about $20,000 - $40,000. The average is $3.216710^{4} and the median is $3.029110^{4}.

We see a fairly normal distribution with median rent in the locations in which contributors live. Rent is very cheap (median of $643) as compared to California but per capita income is also lower.

The median age in which contributors live also resembles a normal distrubtion with median 39.5 Looking back at the summary statistics we can see that most contributions come from areas where the median age is over 40 (39.5 precisely). I could not find exact statistics for the median age in Ohio in 2012, but in 2015 it appears to be 38.8. We might be tempted to say that contributions come from areas where the population is older which would make sense given the high number of retiree contributions. However whether or not the difference is significant would need some inferential analysis which is beyond the scope of this project.

A choropleth map of total contributions by zipcode shows that contributions generally come from areas nearer to cities. There are also quite a few zipcodes without any contributions (seen with the same background color).


5 Univariate Analysis

5.0.1 What is the structure of your dataset?

There are 151,479 instances of campaign contribution in the dataset with 19 features. From the original data set 11 features were kept or derived:

  • candidate
  • name (contributor)
  • city
  • state
  • zipcode
  • employer
  • occupation
  • amount
  • date
  • gender
  • party

Using the zipcode feature, demographic information was added from another data set:

  • total_population
  • percent_white
  • percent_black
  • percent_asian
  • percent_hispanic
  • per_capita_income
  • median_rent
  • median_age

From the original data set, all but name, amount, and date are factors. None of the factors are ordered. Name is a character, amount is numeric, and date is a date object. The 8 demographic features are all numeric.

Other observations:

  • The largest group of contributors by occupation are retirees
  • About 60% of contributions are made by democrats
  • The median contribution amount is $50 and the maximum is $15,000
  • Barack Obama received 60% of contributions
  • Women accounted for 45% of contributors
  • The number of monthly contributions show an exponential increase as the election approaches

5.0.2 What is/are the main feature(s) of interest in your dataset?

The main features of interest in the data set are amount, gender, party, per capita income, and median age. I would like to see if these factors are correlated with contribution amount. Occupation could be of interest however there are too many levels (6,846).

5.0.3 What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I believe that the percent ethnicities, total population, and median rent may be correlated with contribution amount.

5.0.4 Did you create any new variables from existing variables in the dataset?

I created two new variables, one for the gender of the contributor based upon the first name, and another for the party of the contributor based upon the candidate that received the contribution. I was unable to programatically determine gender by first name for about 4,742 instances (approx. 3% of the data).

5.0.5 Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

A log transformation of contribution amount revealed a log-normal distribution. Despite this, we can see in the non-skewed distribution that there are a significant group of people that donate the maximum allowable campaign contribution by law (approx. $2,600). There are also Political Action Committee (PAC) data in the set which have a larger limit (approx. $5,000). I am unsure of the validity of the outliers beyond this amount because of my limited knowledge of campaign finance law. That there were several negative amounts which needed to be corrected to positive leads me to believe that there could be further inaccuracies in the data set.
The data came in a tidy format and did not need to be transformed.


6 Bivariate Plots Section

Before I plot any bivariate relationships, I want to see a correlation matrix of my variables of interest. This will help to focus the rest of my exploration and cut down on uninformative plots.

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##               amount     gender      party population   pcnt_wht
## amount             1 Polyserial Polyserial    Pearson    Pearson
## gender         0.126          1 Polychoric Polyserial Polyserial
## party         0.4541     0.3664          1 Polyserial Polyserial
## population  -0.05573  -0.003391   -0.03011          1    Pearson
## pcnt_wht     0.03888    0.07609     0.3292   -0.06022          1
## pcnt_blk    -0.03743    -0.0805    -0.3258     -0.006    -0.9711
## pcnt_asn     0.03653    0.01284   -0.03036     0.1826   -0.09309
## pcnt_hsp    -0.03344   0.005547   -0.07443     0.1856    -0.2223
## percap_incm    0.159    0.02015     0.1397   -0.02979     0.2596
## med_rent     0.07272    0.01862    0.06658     0.1581     0.1251
## med_age      0.08183     0.0188     0.2056    -0.1739     0.2695
## time         -0.1117     -0.036   -0.09182    0.03104    0.01332
##               pcnt_blk   pcnt_asn   pcnt_hsp percap_incm   med_rent
## amount         Pearson    Pearson    Pearson     Pearson    Pearson
## gender      Polyserial Polyserial Polyserial  Polyserial Polyserial
## party       Polyserial Polyserial Polyserial  Polyserial Polyserial
## population     Pearson    Pearson    Pearson     Pearson    Pearson
## pcnt_wht       Pearson    Pearson    Pearson     Pearson    Pearson
## pcnt_blk             1    Pearson    Pearson     Pearson    Pearson
## pcnt_asn      -0.08454          1    Pearson     Pearson    Pearson
## pcnt_hsp       0.06317    0.07302          1     Pearson    Pearson
## percap_incm    -0.3128     0.4696    -0.1276           1    Pearson
## med_rent       -0.1987     0.5138   -0.04807      0.7485          1
## med_age        -0.1873    -0.2226    -0.2259      0.3313     0.1663
## time          -0.01758    0.01289   0.006405    0.002254    0.01291
##                med_age       time
## amount         Pearson    Pearson
## gender      Polyserial Polyserial
## party       Polyserial Polyserial
## population     Pearson    Pearson
## pcnt_wht       Pearson    Pearson
## pcnt_blk       Pearson    Pearson
## pcnt_asn       Pearson    Pearson
## pcnt_hsp       Pearson    Pearson
## percap_incm    Pearson    Pearson
## med_rent       Pearson    Pearson
## med_age              1    Pearson
## time         -0.003134          1
## 
## Standard Errors:
##               amount   gender    party population  pcnt_wht pcnt_blk
## amount                                                              
## gender      0.003466                                                
## party       0.003641 0.003804                                       
## population  0.002628 0.003312 0.003334                              
## pcnt_wht    0.002632 0.003286 0.003361   0.002627                   
## pcnt_blk    0.002633 0.003286 0.003523   0.002636 0.0001501         
## pcnt_asn    0.002633 0.003318 0.003351   0.002549  0.002614 0.002618
## pcnt_hsp    0.002633 0.003315  0.00348   0.002546  0.002506 0.002626
## percap_incm  0.00257 0.003311 0.003263   0.002634  0.002459 0.002378
## med_rent    0.002622 0.003316 0.003304    0.00257  0.002595 0.002532
## med_age     0.002619 0.003309 0.003278   0.002557  0.002445 0.002544
## time        0.002604 0.003321 0.003276   0.002634  0.002636 0.002636
##             pcnt_asn pcnt_hsp percap_incm med_rent  med_age
## amount                                                     
## gender                                                     
## party                                                      
## population                                                 
## pcnt_wht                                                   
## pcnt_blk                                                   
## pcnt_asn                                                   
## pcnt_hsp    0.002622                                       
## percap_incm 0.002055 0.002593                              
## med_rent     0.00194  0.00263    0.001159                  
## med_age     0.002506 0.002502    0.002347 0.002564         
## time        0.002636 0.002636    0.002636 0.002636 0.002636
## 
## n = 143873 
## 
## P-values for Tests of Bivariate Normality:
##             amount gender party population pcnt_wht pcnt_blk pcnt_asn
## amount                                                               
## gender           0                                                   
## party            0 0.3947                                            
## population       0      0     0                                      
## pcnt_wht         0      0     0          0                           
## pcnt_blk         0      0     0          0        0                  
## pcnt_asn         0      0     0          0        0        0         
## pcnt_hsp         0      0     0          0        0        0        0
## percap_incm      0      0     0          0        0        0        0
## med_rent         0      0     0          0        0        0        0
## med_age          0      0     0          0        0        0        0
## time             0      0     0          0        0        0        0
##             pcnt_hsp percap_incm med_rent med_age
## amount                                           
## gender                                           
## party                                            
## population                                       
## pcnt_wht                                         
## pcnt_blk                                         
## pcnt_asn                                         
## pcnt_hsp                                         
## percap_incm        0                             
## med_rent           0           0                 
## med_age            0           0        0        
## time               0           0        0       0

Party, percapita income of the contributor’s zipcode, gender, and time are the only variables with a correlation coefficient above 0.10. Time has a negative correlation with amount indicating that despite the increase in contributions leading up to the election, amount of a contribution seems to wane as the election approaches. I’ve never been to Ohio but because of the strong negative correlation between percentage of white and black ethnicity for a contributor’s zipcode, I would think that it is somewhat segregated.

## 
##  One-way analysis of means
## 
## data:  df$amount and as.factor(year(df$date))
## F = 522.96, num df = 1, denom df = 150200, p-value < 2.2e-16

Contributions really ramp up in the summer before the election and the average contribution amount is significantly higher in 2012 than in 2011.

## 
##  One-way analysis of means
## 
## data:  df$amount and df$gender
## F = 1310.8, num df = 1, denom df = 145530, p-value < 2.2e-16

Contributions to the Democratic party came from a slight female majority whereas contributions to the Republican party came from an overwhelming male majority. The mean contribution amount as well as the IQ range is larger for males than females.

## 
##  One-way analysis of means
## 
## data:  df$amount and df$party
## F = 6963.4, num df = 2, denom df = 150200, p-value < 2.2e-16

This definitely resembles the previous plot. Republicans also have a higher mean contribution amount and IQ range than democrats.

## [1] 0.1585001

This is a weak correlation as seen by the points in the plot and the trend is probably due to the higher per capita income demographic around 60K.

## [1] -0.1101269

It’s interesting to see that there is a second peak on this plot. Almost like a final rally in contribution amount just before the election.

Next I want to look at total contribution amounts by party on the map. My guess is that rural areas will show more Republican contributions. The colors for each map correspond to percentile buckets with cuts at 15%, 30%, 45%, 55%, 70%, %85, and 100%.

Both of the maps seem to have the heaviest concentrations of total amounts coming from city areas but it does look like there are more dark colors away from cities for the republican party.


7 Bivariate Analysis

7.0.1 Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Relatively speaking, there were no strong relationship discovered in the numerical correlations. There were however significant correlations (abs. value > 3%) among all of the variables and amount. Per capita income had the strongest correlation (16%) followed by time (numerical date), median age, median rent, total population, percent asian, percent white, percent black, and percent hispanic with the lowest (-3.25%).
Time, total population, percent black, and percent hispanic were all negatively correlated. Among the ordered factors of gender and party, republican contributions were on average higher than democrat, as were male contributions higher than female.

7.0.2 Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I did not expect that time and total population would be correlated with contribution amount. It appears that early on contributions are largest, which might make sense to support a candidate for a longer campaign. Total population seems a bit arbitrary as zipcodes are not necessarily zoned for equal area.

7.0.3 What was the strongest relationship you found?

The strongest relationship among all the variables was that between percent white and percent black of a contributors location. These are negatively correlated at 97%. Among the variables of interest the strongest correlation was between per capita income and amount which I suspected to be so.


8 Multivariate Plots Section

Now I’ll take a look at relationships between amount and a couple of a my highly correlated features from the bivariate exploration. There are a lot of contributions in the dataset and with scatterplots it’s difficult to see the relationships especially since the correlations are weak, so I’ll be using smoothing lines instead.

The first plot of amount by per capita income faceted by gender doesn’t really give us any new information. Both lines for gender have a similar slope and mirror each other. The vertical translation between the lines only highlights what we saw before in the bivariate section with respect to the differences in mean contribution amount across gender. The second plot, however, seems to show that as the per capita income of a contributor’s zipcode increases, republicans make donations in larger amounts as compared to democrats. The high starting point for the LOESS smoothing line in both plots looks like it’s due to some outliers who donated large amounts but who reportedly reside in areas with low per capita income.

As in the previous plot grid, time doesn’t seem to be a factor across gender for contribution amount. Converseley, there is a difference across party, and it looks as if contribution amounts for republicans have 2 peaks. The first peak is somewhat shared for both parties and corresponds to what we saw before with larger contribution amounts in 2011 vs 2012. The second peak is pretty exclusive to the republican party though, and it looks like republicans donate in larger amounts right before the election as compared to democrats.

I want to look at the average contribution by zipcode to see if there is any difference between urban and suburban/rural areas.

It looks like the areas with the highest average contribution amounts aren’t necessarily closest to cities. If 1 on 1 interaction between candidates and people is what gets donations for a campaign, then time would be best spent in these outer city areas.

I want to build a linear model of my most highly correlated features to try and predict amount. I’ll be predicting the log of amount since we saw that the distribution was log-normal. I’ll also add features one by one to my model in order of highest correlation to see if there are improvements in performance.

## 
## Calls:
## m1: lm(formula = log(amount) ~ party, data = model_df)
## m2: lm(formula = log(amount) ~ party + percap_incm, data = model_df)
## m3: lm(formula = log(amount) ~ party + percap_incm + gender, data = model_df)
## m4: lm(formula = log(amount) ~ party + percap_incm + gender + time, 
##     data = model_df)
## 
## ==========================================================================
##                                  m1         m2         m3         m4      
## --------------------------------------------------------------------------
##   (Intercept)                  3.791***   3.316***   3.220***  27.722***  
##                               (0.004)    (0.010)    (0.011)    (0.460)    
##   party: republican/democrat   1.112***   1.074***   1.025***   1.000***  
##                               (0.007)    (0.007)    (0.007)    (0.007)    
##   percap_incm                             0.000***   0.000***   0.000***  
##                                          (0.000)    (0.000)    (0.000)    
##   gender: male/female                                0.206***   0.202***  
##                                                     (0.007)    (0.007)    
##   time                                                         -0.000***  
##                                                                (0.000)    
## --------------------------------------------------------------------------
##   R-squared                         0.2        0.2        0.2        0.2  
##   adj. R-squared                    0.2        0.2        0.2        0.2  
##   sigma                             1.2        1.2        1.2        1.2  
##   F                             27956.3    15536.2    10747.7     8928.6  
##   p                                 0.0        0.0        0.0        0.0  
##   Log-likelihood              -236032.7  -234738.7  -234258.3  -232853.0  
##   Deviance                     221771.5   217836.1   216393.0   212225.7  
##   AIC                          472071.3   469485.4   468526.6   465718.0  
##   BIC                          472100.9   469524.9   468576.0   465777.2  
##   N                            144541     144541     144541     144541    
## ==========================================================================
## Analysis of Variance Table
## 
## Model 1: log(amount) ~ party
## Model 2: log(amount) ~ party + percap_incm
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1 144539 221771                                  
## 2 144538 217836  1    3935.4 2611.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: log(amount) ~ party + percap_incm
## Model 2: log(amount) ~ party + percap_incm + gender
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1 144538 217836                                  
## 2 144537 216393  1    1443.1 963.88 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: log(amount) ~ party + percap_incm + gender
## Model 2: log(amount) ~ party + percap_incm + gender + time
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1 144537 216393                                  
## 2 144536 212226  1    4167.2 2838.1 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## Call:
## lm(formula = log(amount) ~ party + percap_incm + gender + time, 
##     data = model_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3095 -0.8334 -0.0842  0.7320  4.8197 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.772e+01  4.600e-01   60.26   <2e-16 ***
## partyrepublican  1.000e+00  6.749e-03  148.20   <2e-16 ***
## percap_incm      1.553e-05  2.948e-07   52.66   <2e-16 ***
## gendermale       2.022e-01  6.580e-03   30.72   <2e-16 ***
## time            -1.822e-08  3.421e-10  -53.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.212 on 144536 degrees of freedom
## Multiple R-squared:  0.1981, Adjusted R-squared:  0.1981 
## F-statistic:  8929 on 4 and 144536 DF,  p-value: < 2.2e-16

The analysis of variance tables show that each feature added was significant and helped to improve the fit of the model. Even so, the correlation coefficients for per capita income and time were very small compared to party and gender. This was surprising as per capita income had a stronger correlation to amount than gender. Looking at the residuals it seems as though there is a a kind of slope towards negative error. The standardized residuals are very close to normal which indicates that our linear model to predict … Despite this, the linear model only accounted for about 20% of the variation in donation amount which means that our features are still poor predictors for contribution amount.


9 Multivariate Analysis

9.0.1 Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I looked at amount against all of the significantly correlated features but adding in gender and/or party as a third variable. In nearly all of the comparisons gender and party proved to be a significant in differentiating total contribution amount. Specifically, contribution amount was higher for males than females as was it for Republicans versus Democrats. Also, because of the majority male constituency for Republican doners, we see that male trends generally mirror Republican trends as do female trends mirror Democratic trends.

9.0.2 Were there any interesting or surprising interactions between features?

I knew that gender and party might be significant factors but I did not know to what extent (these features could not be analyzed in the correlation table). I was somewhat surprised to see that differences in gender and party were universal across all other features with respect to contribution amount. An interesting finding was that Democratic contribution amounts show little increase with increasing per capita income of the contributors demographic as compared to Republican contribution amounts. Also, looking at contribution amount over time by party showed that, despite both parties having larger contribution amounts earlier on, Republicans increased their contribution amount leading up to the election whereas Democrats do not. The same is true for males over females but the relationship is less pronounced. This sort of last minute increase in contribution amount reminds me of a type of rally behavior. Whether or not this is effective in catapulting a candidate to nomination is a whole other question altogether but I doubt it to be so (especially since Romney lost Ohio in 2012). Another interesting difference between males and females is that as the median age of the demographic of the contributor increases, male contribution amount tends to increase whereas female contribution shows a slight decrease. The same idea applies for Republicans and Democrats, Republicans showing an increase in amount as the median age of the contributor’s zipcode increases but holding steady for Democrats. If median age of the contributors zipcode did in fact reflect the actual age of the contributor, we could hypothesize that females are less inclined to donate large amounts as they get older.

9.0.3 OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I experimented with several different models and found that one which modeled the log of amount with per capita income, gender, and party was the most effective in explaining variance in contribution amount. I tried to include time as a factor because it was highly negatively correlated with amount however adding this feature only decreased the R-squared value. Adding all of the remaining significantly correlated features had the same effect to decrease the R-squared value.


10 Final Plots and Summary

10.0.1 Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    25.0    50.0   210.8   150.0 10000.0

10.0.2 Description One

The first plot shows both how contribution amounts are distributed log-normally and how contributions increase leading up to an election.

10.0.3 Plot Two

10.0.4 Description Two

This plot grid shows that, in Ohio, most Democratic contributions were made by females and an overwhelming majority of Republican contributions were made by males. Also that, in general, Republican and male contribution amounts are higher than Democrat and female contribution amounts. An interesting rally phenomenon can be seen here with Republican party as the election date approaches.

10.0.5 Plot Three

10.0.6 Description Three

This final plot shows that, despite total contribution amount being highly correlated with city center, average contribution amount is highest surrounding a city and also in some rural areas. This may shed light on why presidential candidates spend a significant amount of time campaigning in suburbs and seemingly rural areas.


11 Reflection

In my investigation of 2012 Presidential Campaign Contributions for the state of Ohio, I chose to focus on finding the most significant features of a contributors information that could be used to predict the actual contribution amount. The most significant features proved to be per capita income of the area in which the contributor lives (which we can assume gives an idea of the contributor him or herself), the gender of the contributor, and the political party affiliation of the contributor (simplified to be either Democrat or Republican). Per capita income of the contributor’s zipcode has a positive correlation with the contributor’s contribution amount. Males have on average higher contribution amounts than females as do Republicans versus Democrats. This would lead one to conclude that given these correlated features, the highest contribution amount could belong to a male Republican who lives in an area with high per capita income. The lowest contribution amount might belong to a contributor who is a female Democrat and who lives in an area with a low per capita income.

Despite these findings, the model that was developed was only able to account for about 18% of the variation in contribution amount. It would have been great to have actual income, age, and ethnicity of the contributor him/herself. I believe that these would have had a much higher correlation than the demographic information of the contributors zipcode. The demographic information was at best a rough approximation of the contributor.

With regard to the choropleth maps, it seemed apparent that total contributions and total contribution amount were highly correlated with city proximity. Average contribution appeared to be higher closer to cities, but with a buffer between the actual city center and high average contribution amounts.

There are several shortcomings of the data set. First, I question the validity of some of the information as several contribution amounts had to be changed from negative to positive. Second, as compared to other zipcodes, some lacked a substantial amount of data. This may have skewed the average contribution choropleth map. Another shortcoming was the inability to programatically determine the gender by first name of the contributor for about 4% of the data. I had a difficult time with regex in R, were I more adept at this, that data might have been included.

If possible, further analysis could include distance, or some measure of proximity, to a city center. The choropleth maps that were generated attempted to show a spatial relationship between amount and cities. This was however at best an approximation without any concrete measurements to back-up the claims/insights. To do this, an average latitude and longitude value could be calculated and added as a variable for each zipcode, and another variable could be added for distance to the closest city.